Phyloseq tutorial

1 Aim

This document explains the use of the phyloseq R library to analyze metabarcoding data.

3 Data

Carbom cruise

Carbom cruise

This tutorial uses a reduced metabarcoding dataset obtained by C. Ribeiro and A. Lopes dos Santos. This dataset originates from the CARBOM cruise in 2013 off Brazil and corresponds to the 18S V4 region amplified on flow cytometry sorted samples (see pptx file for details) and sequenced on an Illumina run 2*250 bp analyzed with mothur.

3.1 References for data

  • Gérikas Ribeiro, C., Lopes dos Santos, A., Marie, D., Helena Pellizari, V., Pereira Brandini, F., and Vaulot, D. (2016). Pico and nanoplankton abundance and carbon stocks along the Brazilian Bight. PeerJ 4, e2587. doi:10.7717/peerj.2587.
  • Gérikas Ribeiro, C., Marie, D., Lopes dos Santos, A., Pereira Brandini, F., and Vaulot, D. (2016). Estimating microbial populations by flow cytometry: Comparison between instruments. Limnol. Oceanogr. Methods 14, 750–758. doi:10.1002/lom3.10135.
  • Gérikas Ribeiro C, Lopes dos Santos A, Marie D, Brandini P, Vaulot D. (2018). Relationships between photosynthetic eukaryotes and nitrogen-fixing cyanobacteria off Brazil. ISME J in press.
Carbom paper in ISME

Carbom paper in ISME

4 To be added

  • Exercices

6 Gettin started

  • Transfer the files that you downloaded from https://github.com/vaulot/R_tutorials/archive/master.zip to a directory on the computer or server you are using “C:/R_tutorial” or “home/R_tutorial”

  • Navigate to the /R_tutorial/phyloseq directory. You should see the following files :

Phyloseq_tutorial.Rmd
  • Double click on the file and this should open R Studio
R studio

R studio

  • Run the “R chunks”

7 Step by step

7.1 Load necessary libraries

## Warning: package 'ggplot2' was built under R version 3.5.1
## Warning: package 'dplyr' was built under R version 3.5.1

7.2 Read the data and create phyloseq objects

Change your working directory to where the files are located

Three tables are needed

  • OTU
  • Taxonomy
  • Samples

They are read from a single Excel file where each sheet contains one of the tables

Table OTU - OTU abundance

Table OTU - OTU abundance

Table Taxo - OTU taxonomy

Table Taxo - OTU taxonomy

Table Samples

Table Samples

Phyloseq objects need to have row.names

  • define the row names from the otu column
  • remove the column otu since it is now used as a row name
  • Idem for the two other matrixes

Transform into matrixes otu and tax tables (sample table can be left as data frame)

Transform to phyloseq objects

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 287 taxa and 55 samples ]
## sample_data() Sample Data:       [ 55 samples by 27 sample variables ]
## tax_table()   Taxonomy Table:    [ 287 taxa by 7 taxonomic ranks ]

Visualize data

##  [1] "X10n"   "X10p"   "X11n"   "X11p"   "X120n"  "X120p"  "X121n" 
##  [8] "X121p"  "X122n"  "X122p"  "X125n"  "X125p"  "X126n"  "X126p" 
## [15] "X127n"  "X13n"   "X13p"   "X140n"  "X140p"  "X141n"  "X141p" 
## [22] "X142n"  "X142p"  "X155n"  "X155p"  "X156n"  "X156p"  "X157n" 
## [29] "X157p"  "X15n"   "X15p"   "X165n"  "X165p"  "X166n"  "X166p" 
## [36] "X167n"  "X167p"  "X1n"    "X1p"    "X2n"    "X2p"    "X3n"   
## [43] "X3p"    "X5n"    "X5p"    "X7n"    "X7p"    "X9n"    "X9p"   
## [50] "tri01n" "tri01p" "tri02n" "tri02p" "tri03n" "tri03p"
## [1] "Domain"     "Supergroup" "Division"   "Class"      "Order"     
## [6] "Family"     "Genus"
##  [1] "fraction"          "Select_18S_nifH"   "total_18S"        
##  [4] "total_16S"         "total_nifH"        "sample_number"    
##  [7] "transect"          "station"           "depth"            
## [10] "latitude"          "longitude"         "picoeuks"         
## [13] "nanoeuks"          "bottom_depth"      "level"            
## [16] "transect_distance" "date"              "time"             
## [19] "phosphates"        "silicates"         "ammonia"          
## [22] "nitrates"          "nitrites"          "temperature"      
## [25] "fluorescence"      "salinity"          "sample_label"

Keep only samples to be analyzed

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 287 taxa and 54 samples ]
## sample_data() Sample Data:       [ 54 samples by 27 sample variables ]
## tax_table()   Taxonomy Table:    [ 287 taxa by 7 taxonomic ranks ]

Keep only photosynthetic taxa

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 205 taxa and 54 samples ]
## sample_data() Sample Data:       [ 54 samples by 27 sample variables ]
## tax_table()   Taxonomy Table:    [ 205 taxa by 7 taxonomic ranks ]

Normalize number of reads in each sample using median sequencing depth.

The number of reads used for normalization is 44903.

7.4 Heatmaps

A basic heatmap using the default parameters.

## Warning: Transformation introduced infinite values in discrete y-axis

It is very very cluttered. It is better to only consider the most abundant OTUs for heatmaps. For example one can only take OTUs that represent at least 20% of reads in at least one sample. Remember we normalized all the sampples to median number of reads (total). We are left with only 33 OTUS which makes the reading much more easy.

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 33 taxa and 54 samples ]
## sample_data() Sample Data:       [ 54 samples by 27 sample variables ]
## tax_table()   Taxonomy Table:    [ 33 taxa by 7 taxonomic ranks ]
## OTU Table:          [8 taxa and 5 samples]
##                      taxa are rows
##         X10n  X10p  X11p X120n X120p
## Otu001 13339  7346  3804 12662     3
## Otu002    18  8329 14958    30 36206
## Otu003  9692 10488    20 16537    11
## Otu004  3584  4943    33     7     9
## Otu005     0     6    11     0     5
## Otu006     0     9     0     0     5
## Otu007  4473   605   587  5894     3
## Otu008     1     9  6707     2    17
## Warning: Transformation introduced infinite values in discrete y-axis

It is possible to use different distances and different multivaraite methods. For example Jaccard distance and MDS and label OTUs with Class, order by Class. We can also change the Palette (the default palette is a bit ugly…).

Many different built-in distances can be used

##     UniFrac1     UniFrac2        DPCoA          JSD     vegdist1 
##    "unifrac"   "wunifrac"      "dpcoa"        "jsd"  "manhattan" 
##     vegdist2     vegdist3     vegdist4     vegdist5     vegdist6 
##  "euclidean"   "canberra"       "bray" "kulczynski"    "jaccard" 
##     vegdist7     vegdist8     vegdist9    vegdist10    vegdist11 
##      "gower"   "altGower"   "morisita"       "horn"  "mountford" 
##    vegdist12    vegdist13    vegdist14    vegdist15   betadiver1 
##       "raup"   "binomial"       "chao"        "cao"          "w" 
##   betadiver2   betadiver3   betadiver4   betadiver5   betadiver6 
##         "-1"          "c"         "wb"          "r"          "I" 
##   betadiver7   betadiver8   betadiver9  betadiver10  betadiver11 
##          "e"          "t"         "me"          "j"        "sor" 
##  betadiver12  betadiver13  betadiver14  betadiver15  betadiver16 
##          "m"         "-2"         "co"         "cc"          "g" 
##  betadiver17  betadiver18  betadiver19  betadiver20  betadiver21 
##         "-3"          "l"         "19"         "hk"        "rlb" 
##  betadiver22  betadiver23  betadiver24        dist1        dist2 
##        "sim"         "gl"          "z"    "maximum"     "binary" 
##        dist3   designdist 
##  "minkowski"        "ANY"

You can also buid your own distances.

For vectors x and y the “quadratic” terms are J = sum(x*y), A = sum(x^2), B = sum(y^2) and “minimum” terms are J = sum(pmin(x,y)), A = sum(x) and B = sum(y), and “binary” terms are either of these after transforming data into binary form (shared number of species, and number of species for each row). Somes examples :

  • A+B-2*J “quadratic” squared Euclidean

  • A+B-2*J “minimum” Manhattan

  • (A+B-2*J)/(A+B) “minimum” Bray-Curtis

  • (A+B-2*J)/(A+B) “binary” Sørensen

  • (A+B-2*J)/(A+B-J) “binary” Jaccard

Another strategy is to do a heatmap for a specific taxonomy group.

For example we can taget the Chlorophyta and then label the OTUs using the Genus.

## Warning: Transformation introduced infinite values in discrete y-axis

7.5 Alpha diversity

Plot Chao1 richness estimator and Shannon diversity estimator.

## Warning: Removed 54 rows containing missing values (geom_errorbar).

Regroup together samples from the same fraction.

7.6 Ordination

Do multivariate analysis based on Bray-Curtis distance and NMDS ordination.

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.2317058 
## Run 1 stress 0.2449061 
## Run 2 stress 0.2503826 
## Run 3 stress 0.240618 
## Run 4 stress 0.2500941 
## Run 5 stress 0.2406116 
## Run 6 stress 0.2536113 
## Run 7 stress 0.250753 
## Run 8 stress 0.403967 
## Run 9 stress 0.2335593 
## Run 10 stress 0.254205 
## Run 11 stress 0.2538757 
## Run 12 stress 0.2477019 
## Run 13 stress 0.230867 
## ... New best solution
## ... Procrustes: rmse 0.1026009  max resid 0.3862903 
## Run 14 stress 0.2661824 
## Run 15 stress 0.2335262 
## Run 16 stress 0.2398248 
## Run 17 stress 0.258742 
## Run 18 stress 0.249438 
## Run 19 stress 0.2408127 
## Run 20 stress 0.2573136 
## *** No convergence -- monoMDS stopping criteria:
##     20: stress ratio > sratmax

Plot OTUs

A bit confusing, so make it more easy to visualize by breaking according to taxonomic division.

Now display samples and enlarge the points to make it more easy to read.

Display both samples and OTUs but in 2 different panels.

Daniel Vaulot

23 11 2018